Despite what all the anti-“sportsball” folk would say, interest in sports across the country continues to grow. Baseball in particular is growing in international appeal, and the Olympic gold medal has not only gone to the United States but also Cuba, South Korea, and Japan in the last 6 Olympics. The more we can understand the trends of growth in American baseball, the better we can recognize the trends going on internationally right now. This baseball database will help us do that!
Grand Question 1
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
BYUI Players
# Include and execute your code herebatting = pd.read_sql_query('''SELECT CollegePlaying.playerID, schoolID, salary, CollegePlaying.yearID, teamID FROM CollegePlaying LEFT JOIN Salaries USING (playerID) WHERE schoolID="idbyuid" ORDER BY salary desc''', con)Markdown(batting.to_markdown(index=False))
playerID
schoolID
salary
yearID
teamID
lindsma01
idbyuid
4e+06
2001
CHA
lindsma01
idbyuid
4e+06
2002
CHA
lindsma01
idbyuid
3.6e+06
2001
BAL
lindsma01
idbyuid
3.6e+06
2002
BAL
lindsma01
idbyuid
2.8e+06
2001
COL
lindsma01
idbyuid
2.8e+06
2002
COL
lindsma01
idbyuid
2.3e+06
2001
CHA
lindsma01
idbyuid
2.3e+06
2002
CHA
lindsma01
idbyuid
1.625e+06
2001
HOU
lindsma01
idbyuid
1.625e+06
2002
HOU
stephga01
idbyuid
1.025e+06
1991
SLN
stephga01
idbyuid
1.025e+06
1992
SLN
stephga01
idbyuid
900000
1991
SLN
stephga01
idbyuid
900000
1992
SLN
stephga01
idbyuid
800000
1991
SLN
stephga01
idbyuid
800000
1992
SLN
stephga01
idbyuid
550000
1991
SLN
stephga01
idbyuid
550000
1992
SLN
lindsma01
idbyuid
410000
2001
FLO
lindsma01
idbyuid
410000
2002
FLO
lindsma01
idbyuid
395000
2001
FLO
lindsma01
idbyuid
395000
2002
FLO
lindsma01
idbyuid
380000
2001
FLO
lindsma01
idbyuid
380000
2002
FLO
stephga01
idbyuid
215000
1991
SLN
stephga01
idbyuid
215000
1992
SLN
stephga01
idbyuid
185000
1991
PHI
stephga01
idbyuid
185000
1992
PHI
stephga01
idbyuid
150000
1991
PHI
stephga01
idbyuid
150000
1992
PHI
catetr01
idbyuid
nan
2002
There aren’t many professional baseball players from BYU-Idaho, which isn’t too surprising given its past Division III status. I thought about filtering out Mr. Catetr, but with so few to even make it big I figured we’d let him take the credit.
Grand Question 2
A) Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
plot example
# Include and execute your code herebatting1 = pd.read_sql_query('''SELECT playerId, yearID, H, AB, (H+0.0)/(AB+0.0) AS "Batting Average" FROM Batting WHERE AB >= 1 ORDER BY H/AB desc, playerIDLIMIT 5 ''', con)Markdown(batting1.to_markdown(index=False))
playerID
yearID
H
AB
Batting Average
aberal01
1957
1
1
1
abernte02
1960
1
1
1
abramge01
1923
1
1
1
acklefr01
1964
1
1
1
alanirj01
2019
1
1
1
With so many players in the database, it’s not surprising that we have a few that happened to get on base in their only hit of the season.
B) Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
table example
# Include and execute your code herebatting2 = pd.read_sql_query('''SELECT playerId, yearID, (H+0.0) / (AB+0.0) AS "BattingAverage" FROM Batting WHERE AB >= 10 ORDER BY H/AB desc, playerID LIMIT 5''', con)Markdown(batting2.to_markdown(index=False))
playerID
yearID
BattingAverage
aaronha01
1954
0.279915
aaronha01
1955
0.313953
aaronha01
1956
0.328407
aaronha01
1957
0.321951
aaronha01
1958
0.326123
C) Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
Show the code
batting3 = pd.read_sql_query('''SELECT playerId, sum(H+0.0) / sum(AB) AS "Batting Average" FROM Batting GROUP BY playerIdHAVING sum(AB) >= 100ORDER BY "Batting Average" desc, playerID LIMIT 5''', con)Markdown(batting3.to_markdown(index=False))
playerID
Batting Average
cobbty01
0.366299
barnero01
0.359682
hornsro01
0.358497
jacksjo01
0.355752
meyerle01
0.355509
Grand Question 3
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph in Altair to visualize the comparison. What do you learn?
I’ve chosen to compare the season-by-season hits of the Chicago Cubs against the Seattle Mariners. I picked the Cubs because I grew up in Chicago and the Cubs are the only team I know about from Chicago, and the Seattle Mariners in honor of the greatest chart-driven documentary series of all time:
History of the Seattle Mariners
Each team is in its main color, with black circles representing the times they won their division and made it to the playoffs
table example
batting4 = pd.read_sql_query('''SELECT name, yearID, H, DivWinFROM teams WHERE name == "Seattle Mariners" OR name == "Chicago Cubs" AND yearID > 1978 ''',con)MarinersTeam = (alt.Chart(batting4[batting4.name=='Seattle Mariners']).encode(x=alt.X('yearID', axis=alt.Axis(title='Year', format='d')),y = alt.Y('H',axis=alt.Axis(title='How many Hits?'),scale=alt.Scale(domain=(600,1800)))).properties(title='How do they do?').mark_line(color='#005C5C'))Playoff = (alt.Chart(batting4[batting4.DivWin =='Y']).encode(x=alt.X('yearID', axis=alt.Axis(title='Year', format='d')),y = alt.Y('H',axis=alt.Axis(title='How many Hits?'),scale=alt.Scale(domain=(600,1800)))).properties(title='How do they do?').mark_circle(color='black'))CubsTeam = (alt.Chart(batting4[batting4.name=='Chicago Cubs']).encode(x=alt.X('yearID', axis=alt.Axis(title='Year', format='d')),y = alt.Y('H',axis=alt.Axis(title='How many Hits per Season?'),scale=alt.Scale(domain=(600, 1800)))).properties(title='Both Teams are Middle of the Road').mark_line(color='red'))MarinersTeam + CubsTeam + Playoff
The big drops in 1981 and 1994 are the result of player strikes against the owners. Less games were played that season, which means all teams had less opportunity to get hits. Compared to previous seasons they look really bad, but compared to other teams they were about average. Outside of those strike years, there isn’t much connection between the two teams’s hits over the years. We can judge by the lack of division wins that they were both pretty mediocre through most of the 2010s, and the declining number of hits seems to match it as well.